-
Notifications
You must be signed in to change notification settings - Fork 685
[Feature] [PD Disaggregation] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports #5415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…factor post-init and usage for all ports
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5415 +/- ##
==========================================
Coverage ? 62.08%
==========================================
Files ? 329
Lines ? 41287
Branches ? 6295
==========================================
Hits ? 25633
Misses ? 13701
Partials ? 1953
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors port configuration for PD-disaggregated deployments by introducing automatic port allocation and consolidating port management logic. The changes aim to simplify configuration by removing the need for users to manually specify multiple ports per service.
Key Changes
- Introduced automatic port discovery via
find_free_ports()function with validation inpost_init_all_ports() - Centralized port extraction logic in
FDConfig.postprocess()to set per-DP ports (local_engine_worker_queue_port, etc.) - Added RDMA environment auto-detection via new
get_rdma_nics.shscript
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/utils.py | Added parse_ports(), find_free_ports(), and modified is_port_available() for port management |
| fastdeploy/engine/args_utils.py | Added post_init_all_ports() to automatically allocate ports; changed default cache_transfer_protocol to "ipc,rdma" |
| fastdeploy/config.py | Moved port extraction logic to postprocess() method; added local_engine_worker_queue_port attribute |
| fastdeploy/worker/worker_process.py | Moved port assignment from pre-config to post-config initialization |
| fastdeploy/engine/engine.py | Updated to use local_engine_worker_queue_port for DP queue services |
| fastdeploy/engine/common_engine.py | Simplified port handling by using local_engine_worker_queue_port directly |
| fastdeploy/engine/expert_service.py | Removed local port/device ID extraction logic (now handled in config) |
| fastdeploy/splitwise/splitwise_connector.py | Changed to use scalar pd_comm_port and local_device_ids |
| fastdeploy/cache_manager/prefix_cache_manager.py | Renamed pid_suffix to ipc_suffix; updated RDMA port usage |
| fastdeploy/cache_manager/cache_transfer_manager.py | Renamed engine_pid to ipc_suffix for clarity |
| fastdeploy/cache_manager/cache_messager.py | Renamed engine_pid to ipc_suffix for consistency |
| fastdeploy/cache_manager/transfer_factory/rdma_cache_transfer.py | Added automatic RDMA environment setup and NIC detection |
| fastdeploy/cache_manager/transfer_factory/get_rdma_nics.sh | New script for detecting RDMA-capable network interfaces |
| examples/splitwise/utils.sh | Improved health check display with UDP port checking |
| examples/splitwise/start_v1_tp1.sh | Simplified example by removing manual port specifications |
fastdeploy/cache_manager/transfer_factory/rdma_cache_transfer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 16 comments.
Comments suppressed due to low confidence (1)
fastdeploy/engine/args_utils.py:557
- This comment appears to contain commented-out code.
# for port in ports:
# assert is_port_available("0.0.0.0", port), f"Parameter `{name}`:{port} is already in use."
juncaipeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
该 PR 旨在实现两个目标:
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.